source: drawn by author
| posterior | \(p(\theta \mid \mathcal{D}) = \tfrac{1}{Z} \,p(\mathcal{D} \mid \theta) \, p(\theta)\) |
| likelihood | \(p(\mathcal{D} \mid \theta)\) |
| prior | \(p(\theta)\) |
| evidence | \(Z := p(\mathcal{D}) = \textstyle\int p(\mathcal{D} \mid \theta) \, p(\theta) \,d\theta\) |
min loss = max likelihood = max posterior (if prior uniform)
source: altered from Amini et al. (2019)
| posterior | \(p(\theta \mid \mathcal{D}) \approx \mathcal{N}(\theta; \mu, \varSigma)\) |
| centered at | \(\mu := \theta_\text{MAP}\) |
| with covariance | \(\varSigma := H^{-1}\) |
| where | \(H = \nabla^2_\theta \mathcal{L}(\mathcal{D};\theta) \vert_{\theta_\text{MAP}}\) |
Probability of \(y\) given that the model predicted \(f(x_*)\) on input \(x_*\).
\[ p(y \mid f(x_*), \mathcal{D}) = \int p(y \mid f_\theta(x_*)) \, p(\theta \mid \mathcal{D}) \,d\theta \]
\[ (H_{f})_{i,j}={\frac {\partial ^{2}f}{\partial x_{i}\,\partial x_{j}}} \]
\[ H_{f} = \begin{bmatrix} {\dfrac {\partial ^{2}f}{\partial x_{1}^{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{n}}}\\[2.2ex] \vdots &\ddots &\vdots \\[2.2ex] {\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{1}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{n}^{2}}} \end{bmatrix} \]
\[ F := \textstyle\sum_{n=1}^N \mathbb{E}_{\widehat{y} \sim p(y \mid f_\theta(x_n))} \left[ gg^\intercal \right] \\ g = \nabla_\theta \log p(\widehat{y} \mid f_\theta(x_n)) \large\vert_{\theta_\text{MAP}} \]
\[ G := \textstyle\sum_{n=1}^N J(x_n) \left( \nabla^2_{f} \log p(y_n \mid f) \Large\vert_{f=f_{\theta_\text{map}}(x_n)} \right) J(x_n)^\intercal \\ J(x_n) := \nabla_\theta f_\theta(x_n) \vert_{\theta_\text{map}} \]
source: altered from Daxberger et al. (2022)
source: Daxberger et al. (2022)
source: Martens and Grosse (2020)
* stock images generated by Stable Diffusion
s bratus [at] student tudelft nl